PyMiner: A method for metabolic pathway design based on the uniform similarity of substrate-product pairs and conditional search

Metabolic pathway design is an essential step in the course of constructing an efficient microbial cell factory to produce high value-added chemicals. Meanwhile, the computational design of biologically meaningful metabolic pathways has been attracting much attention to produce natural and non-natural products. However, there has been a lack of effective methods to perform metabolic network reduction automatically. In addition, comprehensive evaluation indexes for metabolic pathway are still relatively scarce. Here, we define a novel uniform similarity to calculate the main substrate-product pairs of known biochemical reactions, and develop further an efficient metabolic pathway design tool named PyMiner. As a result, the redundant information of general metabolic network (GMN) is eliminated, and the number of substrate-product pairs is shown to decrease by 81.62% on average. Considering that the nodes in the extracted metabolic network (EMN) constructed in this work is large in scale but imbalanced in distribution, we establish a conditional search strategy (CSS) that cuts search time in 90.6% cases. Compared with state-of-the-art methods, PyMiner shows obvious advantages and demonstrates equivalent or better performance on 95% cases of experimentally verified pathways. Consequently, PyMiner is a practical and effective tool for metabolic pathway design.


Step 1 File decompression
pathway design, including the input period, the search period and the evaluation period (S11 Fig). Details demonstrates the detailed information of the pathway that selected from Pathways.

Inputs
Sources gives the initial substrates whose valid inputs format are compound IDs (e.g., C00079 and C00082).
According to the requirement of the pathway design, one can enter more than one substrate in a multi-input mode.
If users are exclusively interested in the heterogenous pathways of one specific chassis microorganism, then an empty value can be set for this input field. In addition, candidate compounds are automatically completed below the Sources box while entering the starting characters of one new substrate. What's more, considering the demand of multi-input field, the Enter key must be used to confirm a new substrate or completely clear this input field.
Target gives the target product, and also the valid input format is compound ID (e.g., C03582). Potential compounds are automatically listed below the Target box while entering the starting characters of a new target product. Here, only single input is supported.
Avoid Compounds gives a set of molecules that will be excluded from the extracted metabolic network (EMN), for example, acetyl-CoA (C00024) can be entered in this set if we want to retrieve the methyl-D-erythritol-4phosphate (MEP) pathway. Furthermore, the multi-input mode of Avoid Compounds is identical to Sources.

Avoid Reactions gives a set of biochemical reactions precluded from the EMN. And the multi-input mode of
Avoid Reactions is the same as Sources and Avoid Compounds.
Host Organism gives the genome-scale metabolic network model (GSMM) of the chassis microorganism that chosen for the production of the target product, the optional values are 'bsu', 'eco', 'kpn', 'llm', 'ppu', 'sce', 'syz' and 'N/A'. All these GSMMs were derived from BIGG [2], and the corresponding microorganisms of these GSMMs are described in Table 1. New GSMMs of some microorganisms (e.g., Corynebacterium glutamicum), if necessary, can be prepared and added to PyMiner by adding the new models' file in JSON format to path S5 '../sourcedata/models/' and appending the descriptions of these new models to './datas/models.json'. MetaCyc [4], and KndPad (this study). The selected database will be distilled to construct an EMN on the first run. The original and distilled information for pathway search of these databases are saved in path '../sourcedata/'.
Maximum Length gives the maximum length (e.g., 4) of the potential pathways to restrict the number of retrieved pathways, while Maximum Times(s) gives the maximum time (e.g., 120s) to restrict the time that can be used for pathway search.
Similarity Difference Threshold gives the threshold ε (e.g., 0.1) used to generate main substrate-product pairs according the uniform similarity of all candidate substrate-product pairs. Furthermore, small value of ε expects a high atom utilization and a high atom conservation in single-step reaction.

S6
Search Method gives the pathway search methods that applied, including 'bfs_naive' and 'dfs_naive', which respectively represent breadth-first search algorithm and depth-first search algorithm. And 'bfs_naive' method is recommended to be used firstly.
All pathways within Maximum Length will be retrieved if Total is checked, otherwise only pathways identical to Maximum Length will be identified. Similarly, only the shortest pathways of the target product will be accepted if Shortest is checked.
Retro Search gives users the freedom to choose the route search direction, that is searching from Sources to Target or searching from Target to Sources. Moreover, Smart Search provides a conditional search strategy (CSS) according to the local total out-degree (LTOD) of the start substrate and the local total in-degree (LTID) of the target product (described in Materials and methods section).
If Infeasibility is checked, biologically infeasible pathways (e.g., corresponding intermediate metabolites are absent from the selected microorganism) will not be dropped, and all pathways will entry the stage of evaluation.
The calculation of atom transfer route or main metabolic flux is a time-consuming process. Therefore, if Atom Trace and Flux are not checked, the potential synthetic pathways of the target product can be retrieved quickly.
And if Aerobic Culture is not checked, the main metabolic flux will be performed under anaerobic conditions. When all inputs are ready, Start push-button will be pressed to start the pathway design cycle.

Pathways
All candidate pathways of the target product (e. Feasibility evaluates the biological feasibility of the corresponding pathway. If 'False' appears in this list cell, a more detailed explanation will be displayed in I-Details and M-Details.
TotalLength is the total length of the corresponding pathway.
EndoLength is the endogenous steps of the corresponding pathway.
HeterLength is the exogenous steps of the corresponding pathway.
InfLength is the biologically infeasible steps of the corresponding pathway.
AtomUtilization is the atom utilization of the initial substrate.
AtomConservation is the atom conservation of the target product.
MetabolicFlux is the main metabolic flux of the corresponding pathway, namely the maximum synthesis rate of the target product.
S-Details gives the string representation (composed of compounds and reactions) of the corresponding pathway.

Tips
Tips prints out all the necessary prompt information (S11 Fig) during the whole cycle of pathway design. During the parameter input phases, the input value to Sources, Target, Avoid Compounds, or Avoid Reactions will be displayed in Tips if Enter key is pressed. When start push-button is pressed, all the pre-set parameters are printed to Input portion in Tips. During or after the pathway search process, some useful information will be printed out to Search portion in Tips, such as the total number of the retrieved pathways. Moreover, Clear push-button can be used to clear the prompt box.

Details
As mentioned above, if one candidate pathway displayed in Pathways is selected, more detailed information will S8 be shown in Details (S10 Fig), including the potential transfer route (highlighted in green) of atoms from the start substrate to the target product, the compounds and the reactions with external links to their corresponding databases (e.g., KEGG, ChEBI [5], Rhea [6] and MetaCyc), and so on. Additionally, six types of reactions may be included in the graphical representation of the selected pathway. In details, green arrow indicates an endogenous irreversible or reversible reaction, blue arrow denotes an exogenous irreversible or reversible reaction, red arrow suggests a biologically infeasible irreversible or reversible reaction. What's more, if users are interested in the detailed atom transfer route of atoms from initial substrate to target product or the structure information of some compounds, mouse wheel could be used to zoom in/out the figures demonstrated in Details.

Step 4 Getting started with PyMiner
Here, several metabolic pathway design examples will be given to cast a glance on the application of PyMiner.
The exogenous pathways to resveratrol. If users are exclusively interested in the heterologous biosynthetic pathways of resveratrol in E. coli, then a null value can be provided to Sources. Compared to S10 Fig, MetaCyc instead of KEGG was adopted here (S2 Fig). In addition to the two reported pathways [7], that is, the second and the third pathways, a new pathway ranked first was identified which starts with 4-hydroxybenzoate. Additionally, other inputs employed in this example are illustrated in S2 Fig. From D-xylose to xylitol. The example applied in this case study came from MRE [8]. As shown in S3 Fig, only one biosynthesis pathway from D-xylose to xylitol was extracted if Total in the Inputs was not checked. However, if Total is checked, three pathways within 2 steps will be retrieved [9]. In addition, other inputs utilized in this case study are demonstrated in S3 Fig. From acetyl-CoA to Artemisinate. As an important precursor of antimalarial drug artemisinin, artemisinate has been used to semi-synthesize artemisinin [10,11]. In this study, twelve pathways in total were identified by PyMiner, and the first one composed of eight endogenous steps and two exogenous steps (S4 Fig) has been S9 experimentally verified [10,11]. This example indicates that if users have no prior knowledge on the length of the potential pathways, a lager value (e.g., 16) is suggested to be provided to Maximum Length, and to this end, Shortest should be checked.
From aldehydo-D-xylose to ethylene glycol. As shown in S9 Fig, when Escherichia coli (eco) was selected to synthesize ethylene glycol from aldehydo-D-xylose, two pathways were retrieved by PyMiner. The length of the second pathway (which was also retrieved by RouteSearch [12]) is 4. And aldehydo-D-xylose as the sole carbon which is missing in Escherichia coli should be added to the culture medium [13]. Furthermore, in order to retrieve these two pathways, Infeasibility was checked. And other inputs used in this case are shown in S9 Fig.